Lab3: Classification
Logistic regression and k-nearest neighbors
Pima Indians Diabetes Database
diabetes: test for diabetes (neg / pos)pregnant: Number of times pregnantglucose: Plasma glucose concentration (glucose tolerance test)pressure: Diastolic blood pressure (mmHg)triceps: Triceps skin fold thickness (mm)insulin: 2-hour serum insulin (\(\mu\)U/ml)mass: Body mass index (weight in kg/\(\text{(height in m)}^2\))pedigree: Diabetes pedigree functionage: Age (years)[1] 768 9
pregnant glucose pressure triceps insulin mass pedigree age diabetes
1 6 148 72 35 0 33.6 0.627 50 pos
2 1 85 66 29 0 26.6 0.351 31 neg
3 8 183 64 0 0 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.288 33 pos
6 5 116 74 0 0 25.6 0.201 30 neg
neg pos
0.65104 0.34896
'data.frame': 768 obs. of 9 variables:
$ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
$ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
$ pressure: num 72 66 64 66 40 74 50 0 70 96 ...
$ triceps : num 35 29 0 23 35 0 32 0 45 0 ...
$ insulin : num 0 0 0 94 168 0 88 0 543 0 ...
$ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
$ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
$ age : num 50 31 32 21 33 30 26 29 53 54 ...
$ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
Arguments
family: a description of the error distribution and link function to be used in the model.
binomial(link = "logit")gaussian(link = "identity")poisson(link = "log")quasipoisson(link = "log")?family for more family functions.epsilon = 1e-8: positive convergence tolerance \(\epsilon\); the iterations converge when \(\frac{|dev - dev_{old}|}{|dev| + 0.1} < \epsilon\).maxit = 25: maximal number of IWLS iterations.trace = FALSE: logical indicating if output should be produced for each iteration.Stepwise Model Path
Analysis of Deviance Table
Initial Model:
diabetes ~ pregnant + glucose + pressure + triceps + insulin +
mass + pedigree + age
Final Model:
diabetes ~ pregnant + glucose + pressure + mass + pedigree
Step Df Deviance Resid. Df Resid. Dev AIC
1 605 579.85 597.85
2 - triceps 1 0.058537 606 579.91 595.91
3 - insulin 1 0.231721 607 580.14 594.14
4 - age 1 0.309372 608 580.45 592.45
Call:
glm(formula = diabetes ~ pregnant + glucose + pressure + mass +
pedigree, family = binomial, data = training)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.18825 0.76804 -10.66 < 2e-16 ***
pregnant 0.15893 0.03090 5.14 2.7e-07 ***
glucose 0.03621 0.00385 9.41 < 2e-16 ***
pressure -0.01269 0.00553 -2.29 0.02182 *
mass 0.08477 0.01556 5.45 5.1e-08 ***
pedigree 1.16449 0.33827 3.44 0.00058 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 805.74 on 613 degrees of freedom
Residual deviance: 580.45 on 608 degrees of freedom
AIC: 592.5
Number of Fisher Scoring iterations: 5
Arguments
type
"link" (default) : scale of the linear predictors. \(log(\frac{P(Y=1)}{1-P(Y=1)}) = X\beta\)"response": scale of the response variable. \(P(Y=1) = \frac{exp(X\beta)}{1+exp(X\beta)}\)se.fit: logical switch indicating if standard errors are required. 1 3 12 20 37 39
0.78802 1.65395 2.39064 -1.20540 0.89490 -1.65133
1 3 12 20 37 39
0.68741 0.83942 0.91611 0.23052 0.70990 0.16093
$fit
1 3 12 20 37 39
0.68741 0.83942 0.91611 0.23052 0.70990 0.16093
$se.fit
1 3 12 20 37 39
0.033992 0.043821 0.021990 0.026913 0.048066 0.027057
$residual.scale
[1] 1
"lpsa ~ .": Create a design matrix for all variables except for lpsa."[, -1]": Exclude the column of intercepts. (The intercept term is fitted by default in glmnet)Arguments
family:
"gaussian"(default), "binomial", "poisson", "multinomial", "cox", "mgaussian".glm() family object. (See ?family)Multinomial logistic regression
family = "multinomial"type.multinomial = "grouped" : a grouped lasso penalty is used on the multinomial coefficients for a variable to ensure they are all together. The default is "ungrouped".9 x 5 sparse Matrix of class "dgCMatrix"
s1 s2 s3 s4 s5
(Intercept) -7e+00 -5e+00 -2e+00 -8e-01 -6e-01
pregnant 1e-01 7e-02 2e-02 4e-03 5e-04
glucose 3e-02 2e-02 5e-03 9e-04 1e-04
pressure -9e-03 -3e-03 4e-04 2e-04 3e-05
triceps 2e-03 2e-03 2e-03 4e-04 6e-05
insulin 4e-05 6e-04 4e-04 9e-05 1e-05
mass 7e-02 4e-02 1e-02 2e-03 3e-04
pedigree 9e-01 6e-01 2e-01 3e-02 5e-03
age 1e-02 1e-02 6e-03 1e-03 2e-04
9 x 5 sparse Matrix of class "dgCMatrix"
s1 s2 s3 s4 s5
(Intercept) -8e+00 -7.712 -6.938 -5.10 -2.16
pregnant 1e-01 0.132 0.107 0.05 .
glucose 4e-02 0.034 0.031 0.02 0.01
pressure -1e-02 -0.009 -0.003 . .
triceps . . . . .
insulin -2e-04 . . . .
mass 8e-02 0.075 0.060 0.04 .
pedigree 1e+00 0.975 0.713 0.19 .
age 4e-03 0.003 . . .
ridge.cv <- cv.glmnet(x = x, y = y, alpha = 0,
family = "binomial",
type.measure = "deviance",
nfolds = 10)
ridge.cv
Call: cv.glmnet(x = x, y = y, type.measure = "deviance", nfolds = 10, alpha = 0, family = "binomial")
Measure: Binomial Deviance
Lambda Index Measure SE Nonzero
min 0.0229 100 0.977 0.0480 8
1se 0.1223 82 1.021 0.0368 8
9 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -7.1333e+00
pregnant 1.1527e-01
glucose 2.9295e-02
pressure -9.2739e-03
triceps 1.7856e-03
insulin 3.7511e-05
mass 6.9554e-02
pedigree 9.2749e-01
age 9.6757e-03
lasso.cv <- cv.glmnet(x = x, y = y, alpha = 1,
family = "binomial",
type.measure = "deviance",
nfolds = 10)
lasso.cv
Call: cv.glmnet(x = x, y = y, type.measure = "deviance", nfolds = 10, alpha = 1, family = "binomial")
Measure: Binomial Deviance
Lambda Index Measure SE Nonzero
min 0.0073 38 0.972 0.0456 6
1se 0.0430 19 1.013 0.0275 4
9 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -7.6665655
pregnant 0.1310709
glucose 0.0334065
pressure -0.0085941
triceps .
insulin .
mass 0.0743105
pedigree 0.9603272
age 0.0030311
Arguments
type.measure: loss to use for cross-validation.
"default" : MSE for gaussian models, deviance for logistic and poisson regressions, and partial-likelihood for the Cox model."mse"/"mae": Mean squared/absolute error for all models except the “Cox”."deviance": Deviance for logistic and poisson regressions."class": Misclassification error for binomial and multinomial logistic regressions."auc": Area under the ROC curve for two-class logistic regression."C": Harrel’s concordance measure for Cox models.s: Value(s) of the penalty parameter \(\lambda\).
"lambda.1se" (default): Largest value of \(\lambda\) such that error is within 1 standard error of the minimum."lambda.min": Value of \(\lambda\) that gives the minimum mean cross-validated error.Arguments
method: the resampling method.
"boot", "boot632", "optimism_boot", "boot_all", "cv", "repeatedcv", "LOOCV""LGOCV": for repeated training/test splits"none": only fits one model to the entire training set"oob": for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models"timeslice", "adaptive_cv", "adaptive_boot", "adaptive_LGOCV"summaryFunction: a function to compute performance metrics across resamples.
twoClassSummary: sensitivity, specificity, area under the ROC curve.prSummary: precision, recall, area under the precision-recall curve.multiClassSummary: computes some overall measures of for performance and several averages of statistics calculated from “one-versus-all” configurations.elnet.tune <- train(
diabetes ~ .,
data = training,
method = "glmnet",
trControl = cv10,
tuneLength = 10,
# Specify which metric to optimize
metric = "ROC"
)
elnet.tuneglmnet
614 samples
8 predictor
2 classes: 'neg', 'pos'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 553, 553, 553, 552, 553, 552, ...
Resampling results across tuning parameters:
alpha lambda ROC Sens Spec
0.1 0.00024468 0.83332 0.87077 0.5971542
0.1 0.00056524 0.83332 0.87077 0.5971542
0.1 0.00130578 0.83337 0.87077 0.5971542
0.1 0.00301653 0.83326 0.87128 0.5953755
0.1 0.00696858 0.83366 0.87179 0.5900791
0.1 0.01609832 0.83347 0.87590 0.5748221
0.1 0.03718921 0.83234 0.88872 0.5505534
0.1 0.08591190 0.83097 0.90872 0.5156522
0.1 0.19846764 0.82851 0.93487 0.4469960
0.2 0.00024468 0.83348 0.87077 0.5971542
0.2 0.00056524 0.83348 0.87077 0.5971542
0.2 0.00130578 0.83348 0.87077 0.5971542
0.2 0.00301653 0.83328 0.87128 0.5944664
0.2 0.00696858 0.83363 0.87179 0.5874308
0.2 0.01609832 0.83370 0.87795 0.5703953
0.2 0.03718921 0.83290 0.89282 0.5477866
0.2 0.08591190 0.83147 0.91641 0.5182213
0.2 0.19846764 0.83031 0.95077 0.4264822
0.3 0.00024468 0.83348 0.87077 0.5971542
0.3 0.00056524 0.83348 0.87077 0.5971542
0.3 0.00130578 0.83353 0.87077 0.5971542
0.3 0.00301653 0.83353 0.87077 0.5935968
0.3 0.00696858 0.83399 0.87333 0.5865217
0.3 0.01609832 0.83399 0.87692 0.5676680
0.3 0.03718921 0.83336 0.89641 0.5406719
0.3 0.08591190 0.83222 0.92154 0.5092490
0.3 0.19846764 0.82511 0.96359 0.3575889
0.4 0.00024468 0.83355 0.87077 0.5971542
0.4 0.00056524 0.83355 0.87077 0.5971542
0.4 0.00130578 0.83360 0.87077 0.5971542
0.4 0.00301653 0.83383 0.87077 0.5935968
0.4 0.00696858 0.83408 0.87333 0.5865217
0.4 0.01609832 0.83412 0.87949 0.5677075
0.4 0.03718921 0.83313 0.89692 0.5397233
0.4 0.08591190 0.83212 0.92462 0.5083399
0.4 0.19846764 0.81307 0.97128 0.3023320
0.5 0.00024468 0.83369 0.87077 0.5971542
0.5 0.00056524 0.83369 0.87077 0.5971542
0.5 0.00130578 0.83364 0.87128 0.5953755
0.5 0.00301653 0.83390 0.87077 0.5927273
0.5 0.00696858 0.83397 0.87385 0.5856522
0.5 0.01609832 0.83450 0.88308 0.5650198
0.5 0.03718921 0.83312 0.89692 0.5343083
0.5 0.08591190 0.83050 0.92513 0.5057708
0.5 0.19846764 0.80200 0.97436 0.2534387
0.6 0.00024468 0.83378 0.87077 0.5962846
0.6 0.00056524 0.83378 0.87077 0.5962846
0.6 0.00130578 0.83362 0.87128 0.5953755
0.6 0.00301653 0.83403 0.87179 0.5918577
0.6 0.00696858 0.83399 0.87538 0.5856522
0.6 0.01609832 0.83500 0.88462 0.5640711
0.6 0.03718921 0.83339 0.89692 0.5316601
0.6 0.08591190 0.82736 0.92359 0.5030435
0.6 0.19846764 0.79124 0.97795 0.2035573
0.7 0.00024468 0.83376 0.87077 0.5962846
0.7 0.00056524 0.83376 0.87077 0.5962846
0.7 0.00130578 0.83359 0.87128 0.5953755
0.7 0.00301653 0.83415 0.87179 0.5918577
0.7 0.00696858 0.83444 0.87487 0.5829644
0.7 0.01609832 0.83519 0.88615 0.5613439
0.7 0.03718921 0.83392 0.89692 0.5316996
0.7 0.08591190 0.82206 0.92564 0.4781028
0.7 0.19846764 0.78866 0.98718 0.1294862
0.8 0.00024468 0.83369 0.87077 0.5962846
0.8 0.00056524 0.83369 0.87077 0.5962846
0.8 0.00130578 0.83366 0.87128 0.5953755
0.8 0.00301653 0.83426 0.87128 0.5909881
0.8 0.00696858 0.83469 0.87538 0.5784585
0.8 0.01609832 0.83505 0.88667 0.5569170
0.8 0.03718921 0.83355 0.89744 0.5307510
0.8 0.08591190 0.81532 0.92769 0.4638735
0.8 0.19846764 0.78850 0.99538 0.0490514
0.9 0.00024468 0.83373 0.87128 0.5962846
0.9 0.00056524 0.83373 0.87128 0.5962846
0.9 0.00130578 0.83382 0.87077 0.5945059
0.9 0.00301653 0.83442 0.87282 0.5900791
0.9 0.00696858 0.83478 0.87590 0.5775494
0.9 0.01609832 0.83546 0.88821 0.5541502
0.9 0.03718921 0.83320 0.89897 0.5271937
0.9 0.08591190 0.80958 0.93179 0.4504743
0.9 0.19846764 0.78850 1.00000 0.0027273
1.0 0.00024468 0.83373 0.87128 0.5962846
1.0 0.00056524 0.83373 0.87128 0.5962846
1.0 0.00130578 0.83392 0.87026 0.5945059
1.0 0.00301653 0.83442 0.87282 0.5892095
1.0 0.00696858 0.83515 0.87692 0.5775099
1.0 0.01609832 0.83496 0.88821 0.5541107
1.0 0.03718921 0.83204 0.89897 0.5227273
1.0 0.08591190 0.80302 0.93590 0.4264822
1.0 0.19846764 0.78850 1.00000 0.0000000
ROC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0.9 and lambda = 0.016098.
Arguments
method:
train Models by Tag: http://topepo.github.io/caret/train-models-by-tag.htmlpreProcess: A string vector that defines a pre-processing of the predictor data.
"BoxCox", "YeoJohnson", "expoTrans", "center", "scale", "range", "knnImpute", "bagImpute", "medianImpute", "pca", "ica" and "spatialSign". The default is no pre-processing. alpha lambda
78 0.9 0.016098
elnet <- glmnet(x = x, y = y, family = "binomial",
alpha = elnet.optim$alpha,
lambda = elnet.optim$lambda)
coef(elnet)9 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -7.09642400
pregnant 0.11353076
glucose 0.03111434
pressure -0.00439644
triceps .
insulin .
mass 0.06384842
pedigree 0.78410701
age 0.00099008
knn.tune.1 <- train(
diabetes ~ .,
data = training,
method = "knn",
trControl = cv10,
tuneGrid = data.frame(k = seq(3, 31, 2)),
metric = "ROC"
)
knn.tune.2 <- train(
diabetes ~ .,
data = training,
method = "knn",
preProcess = c("center", "scale"),
trControl = cv10,
tuneGrid = data.frame(k = seq(3, 31, 2)),
metric = "ROC"
)Logistic regression with stepwise selection
?predict.glmtype: “link”, “response”, “terms”Logistic regression with regularization
?predict.glmnettype: “link”, “response”, “coefficients”, “nonzero”, “class”k-NN
?predict.traintype: “raw”, “prob”pred.knn.1 pred.knn.2 pred.elnet pred.step
0.75974 0.76623 0.77273 0.77273